Obi Thompson Sargoni
What are the main take aways from this session?
“A network is, in its simlpest form, a collection of points joined together in pairs by lines” M.E.J Newman, Networks, An Introduction
There are many systems that can be usefully thought of as networks. For example:
Consider hypothetical Twitter data. Each column representing who that person follows.
dfTwitter <- data.frame("Bonnie" = c(0,0,0,1,1),
"Matt" = c(1,0,0,1,0),
"Elsa" = c(0,1,0,0,1),
"Neave" = c(1,0,0,0,0),
"Obi" = c(0,1,1,0,0),
row.names = c("Bonnie", "Matt", "Elsa", "Neave", "Obi"))
dfTwitter
## Bonnie Matt Elsa Neave Obi
## Bonnie 0 1 0 1 0
## Matt 0 0 1 0 1
## Elsa 0 0 0 0 1
## Neave 1 1 0 0 0
## Obi 1 0 1 0 0
Now create an iGraph graph object from this data
graphTwitter <- graph_from_adjacency_matrix(as.matrix(dfTwitter))
plot(graphTwitter)
Graph
A Graph is a mathematical representation of a network, with n nodes and m edges, as seen above. In this example, the nodes represent Twitter accounts and the edges represent following relationships between the accounts.
Nodes (Vertices)
V(graphTwitter)
## + 5/5 vertices, named, from d51e8aa:
## [1] Bonnie Matt Elsa Neave Obi
Edges
E(graphTwitter)
## + 9/9 edges from d51e8aa (vertex names):
## [1] Bonnie->Matt Bonnie->Neave Matt ->Elsa Matt ->Obi
## [5] Elsa ->Obi Neave ->Bonnie Neave ->Matt Obi ->Bonnie
## [9] Obi ->Elsa
Edge List
An equivalent representation of this network is the edge list, with edges pointing from the nodes in columns 1 to the nodes in column 2.
get.edgelist(graphTwitter)
## [,1] [,2]
## [1,] "Bonnie" "Matt"
## [2,] "Bonnie" "Neave"
## [3,] "Matt" "Elsa"
## [4,] "Matt" "Obi"
## [5,] "Elsa" "Obi"
## [6,] "Neave" "Bonnie"
## [7,] "Neave" "Matt"
## [8,] "Obi" "Bonnie"
## [9,] "Obi" "Elsa"
Adjacency matrix
Another important representation of a network is the adjacency matrix.
In iGraph the convention for adjacency matrix elements is (some textbooks use the reverse convention)
\[A_{ij} = \begin{cases} 1, & \text{if there is an edge from node i to j}\\ 0, & \text{otherwise} \end{cases}\]
In this case, the dataframe used to create the network is essentially the adjacency matrix:
get.adjacency(graphTwitter)
## 5 x 5 sparse Matrix of class "dgCMatrix"
## Bonnie Matt Elsa Neave Obi
## Bonnie . 1 . 1 .
## Matt . . 1 . 1
## Elsa . . . . 1
## Neave 1 1 . . .
## Obi 1 . 1 . .
\(A_{Bonnie,Matt} = 1\)
graphTwitter['Bonnie','Matt'] # as.matrix(get.adjacency(graphTwitter))['Bonnie','Matt'] also works
## [1] 1
Directed Graph
A directed graph is a graph where an edge points from one node to another.
Directed graphs have asymmetric adjacency matrices.
isSymmetric(get.adjacency(graphTwitter))
## [1] FALSE
Bonnie is connected to Matt but Matt is not connected to Bonnie \(A_{Bonnie,Matt} \neq A_{Matt, Bonnie}\)
print(graphTwitter['Bonnie','Matt'])
## [1] 1
print(graphTwitter['Matt','Bonnie'])
## [1] 0
plot(graphTwitter)
With the information about our social network represented as a Graph, we can perform some analysis.
Node degree
Undirected graph: the number of links attached to a node \(k_i = \sum_{i=1}^{n}A_{ij} = \sum_{j=1}^{n}A_{ij}\)
Directed graph: either the number of ‘in’ links or ‘out’ links attached to a node.
Recall: \[A_{ij} = \begin{cases} 1, & \text{if there is an edge from node i to j}\\ 0, & \text{otherwise} \end{cases}\]
In degree: number of links pointing to you \(k_{j}^{in} = \sum_{i=1}^{n}A_{ij}\)
Out degree: number of links you point to \(k_{i}^{out} = \sum_{j=1}^{n}A_{ij}\)
In degree of the Twitter Graph
degree(graphTwitter, mode = "in")
## Bonnie Matt Elsa Neave Obi
## 2 2 2 1 2
Out degree
degree(graphTwitter, mode = "out")
## Bonnie Matt Elsa Neave Obi
## 2 2 1 2 2
Paths
A path is a sequence of nodes such that each node is connected to the next node. For example: [‘Bonnie’, ‘Neave’, ‘Matt’, ‘Elsa’].
The shortest path is the shortest possible path between any two nodes.
[‘Bonnie’, ‘Neave’, ‘Matt’, ‘Elsa’] is a path from Bonnie to Elsa but the shortest path is…
shortest_paths(graphTwitter, 'Bonnie', to = 'Elsa',
mode = "out", weights = NULL, output = c("vpath", "epath", "both"),
predecessors = FALSE, inbound.edges = FALSE)
## $vpath
## $vpath[[1]]
## + 3/5 vertices, named, from d51e8aa:
## [1] Bonnie Matt Elsa
##
##
## $epath
## NULL
##
## $predecessors
## NULL
##
## $inbound_edges
## NULL
or use mode=“in” to find the shortest path from the ‘to’ node to the ‘from’ node
shortest_paths(graphTwitter, 'Bonnie', to = 'Elsa',
mode = "in", weights = NULL, output = "vpath",
predecessors = FALSE, inbound.edges = FALSE)
## $vpath
## $vpath[[1]]
## + 3/5 vertices, named, from d51e8aa:
## [1] Bonnie Obi Elsa
##
##
## $epath
## NULL
##
## $predecessors
## NULL
##
## $inbound_edges
## NULL
Connected
A Graph is connected if it is possible to reach any node from any other node. Directed graphs can be ‘strongly’ or ‘weakly’ connected.
dfDisconnected <- data.frame("Bonnie" = c(0,1,0,0,0),
"Matt" = c(1,0,0,0,0),
"Elsa" = c(0,0,0,1,1),
"Neave" = c(0,0,0,0,1),
"Obi" = c(0,0,1,0,0),
row.names = c("Bonnie", "Matt", "Elsa", "Neave", "Obi"))
graphDisconnected <- graph_from_adjacency_matrix(as.matrix(dfDisconnected))
plot(graphDisconnected)
is.connected(graphDisconnected)
## [1] FALSE
The Twitter graph is strongly connected (which means it is also weakly connected)
is.connected(graphTwitter, mode = "strong")
## [1] TRUE
Diameter
The diameter of the network is the length of the longest shortest path between nodes for which a path exists. For an disconnected graph the largest connected component diameter is given.
This avoids infinite diameters
diameter(graphTwitter, directed = TRUE, unconnected = TRUE, weights = NULL)
## [1] 3
Use GIS data of the London Underground to produce graph representing the tube network.
Stations -> Nodes Connections between stations -> Edges
Read the shapefile data
Linestrings with each end being the coordinates of a station
dir_data <- ".\\data\\underground"
sfTube <- readOGR(dsn = dir_data, "underground")
## OGR data source with driver: ESRI Shapefile
## Source: "C:\Users\Obi\Desktop\CASA\Summer School 2019\SS-ASM19-Materials\networks\data\underground", layer: "underground"
## with 410 features
## It has 6 fields
## Integer64 fields read as strings: station_1 station_2
The data contain line strings between the coordinates of tube stations
head(sfTube@lines)
## [[1]]
## An object of class "Lines"
## Slot "Lines":
## [[1]]
## An object of class "Line"
## Slot "coords":
## [,1] [,2]
## [1,] -0.1571 51.5226
## [2,] -0.1631 51.5225
##
##
##
## Slot "ID":
## [1] "0"
##
##
## [[2]]
## An object of class "Lines"
## Slot "Lines":
## [[1]]
## An object of class "Line"
## Slot "coords":
## [,1] [,2]
## [1,] -0.1571 51.5226
## [2,] -0.1466 51.5234
##
##
##
## Slot "ID":
## [1] "1"
##
##
## [[3]]
## An object of class "Lines"
## Slot "Lines":
## [[1]]
## An object of class "Line"
## Slot "coords":
## [,1] [,2]
## [1,] -0.1247 51.5080
## [2,] -0.1223 51.5074
##
##
##
## Slot "ID":
## [1] "2"
##
##
## [[4]]
## An object of class "Lines"
## Slot "Lines":
## [[1]]
## An object of class "Line"
## Slot "coords":
## [,1] [,2]
## [1,] -0.1247 51.5080
## [2,] -0.1342 51.5098
##
##
##
## Slot "ID":
## [1] "3"
##
##
## [[5]]
## An object of class "Lines"
## Slot "Lines":
## [[1]]
## An object of class "Line"
## Slot "coords":
## [,1] [,2]
## [1,] -0.1679 51.5199
## [2,] -0.1631 51.5225
##
##
##
## Slot "ID":
## [1] "4"
##
##
## [[6]]
## An object of class "Lines"
## Slot "Lines":
## [[1]]
## An object of class "Line"
## Slot "coords":
## [,1] [,2]
## [1,] -0.1679 51.5199
## [2,] -0.1755 51.5154
##
##
##
## Slot "ID":
## [1] "5"
and attributes of the linestrings
head(sfTube@data)
## toid_seq station_1 station_1_ station_2 station_2_ distance
## 0 1 11 Baker Street 163 Marylebone 416.5861
## 1 2 11 Baker Street 212 Regent's Park 734.1736
## 2 3 49 Charing Cross 87 Embankment 179.5034
## 3 4 49 Charing Cross 197 Picadilly Circus 689.2898
## 4 5 82 Edgware Road (B) 163 Marylebone 441.2181
## 5 6 82 Edgware Road (B) 193 Paddington 727.2998
Create a graph from the data
Use the attribute data from the shape file to create a graph representing the tube network.
dfTube <- sfTube@data
dfTubeGraph <- data.frame(i = dfTube$station_1, j = dfTube$station_2, distance = dfTube$distance)
graphTube <- graph_from_data_frame(dfTubeGraph, directed = FALSE)
head(get.edgelist(graphTube))
## [,1] [,2]
## [1,] "11" "163"
## [2,] "11" "212"
## [3,] "49" "87"
## [4,] "49" "197"
## [5,] "82" "163"
## [6,] "82" "193"
Add node attribute to the graph
Process the data to get the coordinates, names, and ids of all stations
# fortify extracts x and y coordinates from the points that make up the linestrings and the id of the linestring the points belong to
sfTube.f=fortify(sfTube)
# Create id column that matches the ids of the linestrings
dfTube$id <- as.numeric(as.character(dfTube$toid_seq)) - 1
# Join the coordinates to the data from the shape file using the id column
dfMergedTube <- merge(sfTube.f, dfTube, by.x = "id", by.y = "id")
# Separate out the first and second stations and stack these on top of one another in one long dataframe.
firstStation <- dfMergedTube[dfMergedTube$order == 1,]
secondStation <- dfMergedTube[dfMergedTube$order == 2,]
firstStation <- data.frame(id=firstStation$station_1, name = firstStation$station_1_, lat = firstStation$lat, long = firstStation$long)
secondStation <- data.frame(id=secondStation$station_2, name = secondStation$station_2_, lat = secondStation$lat, long = secondStation$long)
# Combine stations, dropping duplicates
dfStations <- rbind(firstStation, secondStation) %>% unique()
head(dfStations)
## id name lat long
## 1 11 Baker Street 51.5226 -0.1571
## 3 114 Harrow & Wealdston 51.5925 -0.3351
## 4 3 Aldgate East 51.5154 -0.0726
## 6 15 Barking 51.5396 0.0810
## 8 17 Barons Court 51.4905 -0.2139
## 10 18 Bayswater 51.5121 -0.1879
Find the index at which each vertex appears in the dataframe of stations and use this to assign names and positions to the verticies
# Since the vertices don't have names yet, V(graphTube)$name returns a vector of vertex ids
v_pos = match(V(graphTube)$name, as.character(dfStations$id))
# Use the vector of positions to get the node names and coordinates ordered the same as they are in the graph
vector_names <- dfStations$name[v_pos]
vector_coords <- dfStations[v_pos,]
graphTube <- set.vertex.attribute(graphTube, "name", index = V(graphTube), value = as.character(vector_names))
graphTube <- set.vertex.attribute(graphTube, "x", index = V(graphTube), value = as.numeric(vector_coords$long))
graphTube <- set.vertex.attribute(graphTube, "y", index = V(graphTube), value = as.numeric(vector_coords$lat))
Nodes now all have name, x, and y attributes
print(get.vertex.attribute(graphTube, "name", 11))
## [1] "Lambeth North"
print(get.vertex.attribute(graphTube, "x", 11))
## [1] -0.1115
print(get.vertex.attribute(graphTube, "y", 11))
## [1] 51.4991
This graph is undirected and connected.
print(isSymmetric(get.adjacency(graphTube)))
## [1] TRUE
print(is.connected(graphTube))
## [1] TRUE
The graph includes an edge attribute for each edge - ‘distance’ - taken from the distance between stations
get.edge.attribute(graphTube)$distance[1:10]
## [1] 416.5861 734.1736 179.5034 689.2898 441.2181 727.2998 954.9616
## [8] 698.0515 1537.3971 783.2256
Show adjacency matrix with and without weights (topographic vs weighted)
For cases where there are multiple edges between nodes, the inclusion of weights can be misleading.
get.adjacency(graphTube)[1:10,1:10]
## 10 x 10 sparse Matrix of class "dgCMatrix"
## [[ suppressing 10 column names 'Baker Street', 'Charing Cross', 'Edgware Road (B)' ... ]]
##
## Baker Street . . . . . . . . . .
## Charing Cross . . . . 2 . . . . .
## Edgware Road (B) . . . . . . . . . .
## Elephant & Castle . . . . . . . . . .
## Embankment . 2 . . . . . . . .
## Harlesden . . . . . . . . . .
## Harrow & Wealdston . . . . . . . . 1 .
## Kensal Green . . . . . . . . . .
## Kenton . . . . . . 1 . . .
## Kilburn Park . . . . . . . . . .
get.adjacency(graphTube, attr = "distance")[1:10,1:10]
## 10 x 10 sparse Matrix of class "dgCMatrix"
## [[ suppressing 10 column names 'Baker Street', 'Charing Cross', 'Edgware Road (B)' ... ]]
##
## Baker Street . . . . . . . . . .
## Charing Cross . . . . 359.0068 . . . . .
## Edgware Road (B) . . . . . . . . . .
## Elephant & Castle . . . . . . . . . .
## Embankment . 359.0068 . . . . . . . .
## Harlesden . . . . . . . . . .
## Harrow & Wealdston . . . . . . . . 1785.108 .
## Kensal Green . . . . . . . . . .
## Kenton . . . . . . 1785.108 . . .
## Kilburn Park . . . . . . . . . .
From this we see there are two connections between station 87 and 49. Check this in the data.
dfTube[dfTube$station_1 == 49,]
## toid_seq station_1 station_1_ station_2 station_2_ distance
## 2 3 49 Charing Cross 87 Embankment 179.5034
## 3 4 49 Charing Cross 197 Picadilly Circus 689.2898
## 276 277 49 Charing Cross 87 Embankment 179.5034
## 277 278 49 Charing Cross 151 Leicester Square 436.4846
## id
## 2 2
## 3 3
## 276 276
## 277 277
Degree distribution
Calculate degree distribution of the tube network
hist(degree(graphTube))
Shortest paths and diameter
A connected graph is one in which every node can be reached by every other node.
Calculate the shortest path between nodes (crude since the weights are distances and not travel times)
startNode = V(graphTube)["Stockwell"]
endNode = V(graphTube)["Mile End"]
shortest_paths(graphTube, startNode, to = endNode, weights = get.edge.attribute(graphTube)$distance, output = "vpath")
## $vpath
## $vpath[[1]]
## + 12/306 vertices, named, from d692d8b:
## [1] Stockwell Oval Kennington
## [4] Elephant & Castle Borough London Bridge
## [7] Bank Liverpool Street Aldgate East
## [10] Whitechapel Stepney Green Mile End
##
##
## $epath
## NULL
##
## $predecessors
## NULL
##
## $inbound_edges
## NULL
Calculate the diameter with and without using the distance as the weight.
print(diameter(graphTube, directed = FALSE, unconnected = FALSE, weights = NULL))
## [1] 38
print(diameter(graphTube, directed = FALSE, unconnected = FALSE, weights = get.edge.attribute(graphTube)$distance))
## [1] 71491.12
Plot the graph
graphTube=simplify(graphTube,remove.loops = T,remove.multiple = T,edge.attr.comb = "min")
plot(graphTube,vertex.size=3,vertex.color="red",vertex.label.cex=.1)
Resources from a workshop on networks and iGraph: https://kateto.net/wp-content/uploads/2016/01/NetSciX_2016_Workshop.pdf
M.E.J Newman, Networks, An Introductions
Elsa Arcaute
Nodes within a netork can play different roles, and have different levels of importance. Centrality measures help determine what these roles are. In this sense, there is no unique measure of importance, since this is determined with respect to a specific situation/problem at hand. For example, a person might be very important in leading a group of people, while a different one in mediating conflict and so forth. Some examples of centrality measures are:
Within this section we will introduce degree, closeness and betweenness centrality.
Degree centrality
Degree was introduced in the previous section (1.3). Let us take an example of a social network, and see how it works.
I asked my then 8 years old son Yaan to construct a small immediate network of people important for his dad David. How? build a list of person 1| person 2| weight, where the weight is a number between 1 and 10 relating to the importance of the relationship, 10 being the highest.
#file constructed by Yaan
file_network <- ".\\data\\network_David.csv"
#read and get network
dat=read.csv(file_network,header=TRUE)
head(dat)
## node1 node2 weight
## 1 Yaan David 10
## 2 Yaan Elsa 10
## 3 Yaan Abu 10
## 4 Elsa Abu 10
## 5 David Abu 9
## 6 Maxime Yaan 10
g_david=graph.data.frame(dat,directed=FALSE)
#let us plot it
plot(g_david)
#Plotting the network in this way does not give us enough information
#Now let us compute the degree centrality
deg_david=degree(g_david, v=V(g_david), loops = F, normalized = FALSE)
print(deg_david)
## Yaan Elsa David Maxime Valeria Fernandito
## 24 23 23 5 5 11
## Fernando Belli Isi Paul Ximena Pame
## 11 11 7 7 7 11
## Rocio Fer Esteban Jaquie Ana Anna
## 11 11 11 12 3 4
## Stephan Marco Leigh Trond Storm Marcello
## 7 4 6 6 6 4
## Abu
## 22
#we can plot the graph using the measure of degree
plot(g_david,vertex.size=deg_david/max(deg_david)*14,vertex.color=deg_david,vertex.label.cex=0.5)
title(main = "Social network of David seen by Yaan")
In this plot we can see who is the most connected person, any surprises?
Now recall the previous section in which you learnt that there’s ‘in’ and ‘out’ degree. Who is more important?
What if instead of looking at the number of connections you have, you look at how important those connections are? –> Eigenvector centrality More elaborated algorithms looking at not only the importance of the connections, but at the connections pointing to you, etc, are the ones used to rank internet pages for example, such as Page Rank.
Closeness centrality
Let \(d_{ij}\) be the geodesic distance (shortest path) between node \(i\) and \(j\). The mean geodesic distance of node \(i\) to all other nodes is given by: \[l_{i} = \frac{1}{n}\sum_{j}d_{ij}\] where \(n\) is the total number of nodes.
Given that a node that is very close to most nodes, and has hence low mean geodesic, will be more influential than a node which is far away, gives rise to the following definition for the closeness centrality of node \(i\): \[C_{i} = \frac{1}{l_{i}} = \frac{n}{\sum_{j}d_{ij}}\]
Let us compute the closeness centrality for each node in David’s network. Note that when using a software, it is important that you always read the documentation, since for example, in the case of iGraph in R, if not specified, the measure is not divided by the total n. of nodes. If normalised it will be normalised by \(n-1\).
The link to the documentation is here: https://igraph.org/r/doc/closeness.html
# given that the network is weighted let us introduce the weights.
v_weights=1/(E(g_david)$weight)
# Note that we inverted the weights, since they are meant to represent distance.
# the higher the value to closer they are
clos_david=closeness(g_david,weights = v_weights)
# the following two commands are just to choose colours for the nodes, a palette
normalised_clos_david=(clos_david-min(clos_david))/(max(clos_david)-min(clos_david))
palette=hsv(h=1-((normalised_clos_david*2/3)+1/3),s = 1,v=1)
# Not we can plot it
plot(g_david,vertex.color=palette,vertex.size=normalised_clos_david*15,vertex.label.cex=.5)
title(main = "Closeness centrality of David's network according to Yaan")
And now who is the most important person in the network according to closeness centrality? Is this surprising?
The measure computed above took into account the weights of the network. This means that the distance between two nodes that are not connected, might be shorter than the distance between two nodes that are connected but have very low “friendship” number. If this is disregarded, the measure only corresponds to the shortest path computed according to the number of links between them.
#topological closeness
clos_top_david=closeness(g_david, weights = NA)
normalised_clos_top_david=(clos_top_david-min(clos_top_david))/(max(clos_top_david)-min(clos_top_david))
palette_top=hsv(h=1-((normalised_clos_top_david*2/3)+1/3),s = 1,v=1)
plot(g_david,vertex.color=palette_top,vertex.size=normalised_clos_top_david*15,vertex.label.cex=.5)
title(main = "Topological closeness centrality")
Betweenness centrality
The betweenness centrality of a vertex corresponds to the number of shortest paths passing through it among all pairs. Edge betweennness is defined in a smilar, where the edge is within the shortest path.
Define \(n^{i}_{st}\) as: \[n^{i}_{st} = \begin{cases} 1, & \text{if vertex }i\text{ lies on the geodesic path from }s\text{ to }t\\ 0, & \text{otherwise} \end{cases}\]
Then betweenness centrality can be defined as: \[x_{i} = \sum_{st}n^{i}_{st}\]
However, there may be multiple geodesics from \(s\) to \(t\) so to account for this we normalise by the number of geodesics from \(s\) to \(t\), \(g_{st}\): \[x_{i} = \sum_{st}\frac{n^{i}_{st}}{g_{st}}\]
#Let us compute the betweenness centrality for the network
bet_david=betweenness(g_david, v=V(g_david), directed = F, normalized = FALSE, weights = (E(g_david)$weight))
palette=hsv(h=1-((bet_david/max(bet_david)*2/3)+1/3),s = 1,v=1)
plot(g_david,vertex.color=palette,vertex.size=bet_david/max(bet_david)*18,vertex.label.cex=.5)
title(main = "Broker of David's network seen by Yaan")
This is a rather nice result: the grandma is the broker of the system!
Note that such a nice result fooled us. Why? What needs to be ammended?
Let us look at the topological betweenness and see whether we get any insights:
#Topological betweenness centrality for the network
bet_david_top=betweenness(g_david, v=V(g_david), directed = F, normalized = FALSE, weights = NA)
palette_top=hsv(h=1-((bet_david_top/max(bet_david_top)*2/3)+1/3),s = 1,v=1)
plot(g_david,vertex.color=palette_top,vertex.size=bet_david_top/max(bet_david_top)*18,vertex.label.cex=.5)
title(main = "Broker (topological) of David's network seen by Yaan")
There are 2 lessons to learn from this:
assigning arbitrary numbers to relationships can lead to misleading results
we need to be careful when using weights: what are they encoding? how are they representing the system? how can we integrate them correctly?
In order to talk about clustering in networks, one needs to introduce a set of concepts, such as clustering coefficient, modularity, assortativity, etc., which need more than the few remaining minutes that we have. In this sense, we will give you here “a taster” of this field, since community detection is a field in its own right within networks.
Assortativity, dissassortativity
include_graphics(".\\img\\networks2\\assort.png")
include_graphics(".\\img\\networks2\\assort2.png")
Let us look at another example where we do not have more information on the ndoes than what you see below.
include_graphics(".\\img\\networks2\\assort_2nets_plus.png")
include_graphics(".\\img\\networks2\\assort_null.png")
include_graphics(".\\img\\networks2\\modularity.png")
include_graphics(".\\img\\networks2\\comm_detection.png")
There are two types of hierarchical clustering algorithm: Agglomerative and Divisive
–> both methods are based on assigning a measure for each link, say a similarity measure between nodes, or a centrlaity measure for the link, and then depending on the algorithm we have: either nodes connected by high value links (e.g. high similarity) are merged, or links with low value are removed (e.g. for low similarity).
The process for hierarchical clustering is as follows:
Define a similarity measure, or a centrality measure
Commpute the similarity or centrality measure for the network
Iteratively merge nodes or remove links until all nodes are merged, or all links are removed
Build a dendogram odf the process
Identify the cut in the tree that maximises the modularity
The following slides taken from Barabasi’s lecture for the Girvan-Newman algorithm illustrate the method very clearly
include_graphics(".\\img\\networks2\\G_N1.png")
include_graphics(".\\img\\networks2\\G_N2.png")
Karate Club network
In this section we apply some of the community detection algorithms available in iGraph to Zachary’s Karate club network. Look it up, this is the most famous netwrk to test community detection algorithms.
#It's so famous it's already in iGraph
g_karate <- make_graph("Zachary")
#let us now produce a plot such that the nodes in the plot have a size related to their degree
V(g_karate)$size=degree(g_karate)
plot(g_karate,vertex.label=NA)
title('Karate club friendship network', cex.main = 1.25,line=2)
title('Zachary 1977', cex.main = 1, line=1)
Girvan Newman method M. Girvan and M. E. Newman, Proc. Natl. Acad. Sci. U.S.A., 99, 782 (2002)
M. E. Newman and M. Girvan, Physical Review E 69, 026113 (2004)
iGraph: edge.betweenness.community
http://igraph.org/r/doc/community.edge.betweenness.html
http://www.inside-r.org/packages/cran/igraph/docs/edge.betweenness.community
How does it work?
Hierarchical divisive algorithm: remove edges (links) according to their edge betweenness values (in decreasing order).
Removal stops when modularity of the partition is maximised.
–> Edges with high betweenness will most likely be the ones connecting and hence belonging to different groups.
Performance
Good
slow: O(N3): betweenness needs to be re-calculated each time that an edge is removed
Weighted graphs? Yes
Directed graphs? Yes
#------------
#Girvan-Newman algorithm
#------------
alg_name='Girvan-Newman'
GN_comm <- edge.betweenness.community(g_karate)
#look at result of algorithm
print(GN_comm)
## IGRAPH clustering edge betweenness, groups: 5, mod: 0.4
## + groups:
## $`1`
## [1] 1 2 4 8 12 13 14 18 20 22
##
## $`2`
## [1] 3 25 26 28 29 32
##
## $`3`
## [1] 5 6 7 11 17
##
## $`4`
## + ... omitted several groups/vertices
#the process aims at maximising the modularity
modularity(GN_comm)
## [1] 0.4012985
# number of communities and their sizes
nc=length(GN_comm)
sizes(GN_comm)
## Community sizes
## 1 2 3 4 5
## 10 6 5 12 1
#if you need the membership vector only
mem_vec=membership(GN_comm)
#visualise the results
plot(GN_comm,g_karate,layout=layout.fruchterman.reingold)
title(paste0(alg_name,' algorithm \n',nc,' communities for the karate club'))
#The division into communities is determined by the choice of location for cutting the hierarchical tree.
#The threshold is selected such that the modularity is maximised
#Let us look at this through the dendrogram
dendPlot(GN_comm,use.modularity = TRUE)
title(paste0('Community structure dendrogram for ',alg_name,' method \n',nc,' communities for the karate club'))
# !!!!!! This method is based on computing betweenness in an iterative way, and hence it is extremely slow for large networks.
Fast Greedy modularity optimization: Clauset, Newman and Moore
Clauset, M. E. J. Newman, and C. Moore, Phys. Rev. E 70, 066111 (2004)
http://www.arxiv.org/abs/cond-mat/0408187
iGraph: fastgreedy.community
http://igraph.org/r/doc/fastgreedy.community.html
http://www.inside-r.org/packages/cran/igraph/docs/fastgreedy.community
How does it work?
Assign each node to its own community, no links between communities
Inspect each community pair connected by a link, and assign them to the same community if the link increases maximally the Girvan-Newman modularity.
Bottom-up instead of top-down: communities are merged iteratively such that each merge is locally optimal
Performance
Not as good as Girvan-Newman
fast: O(NLog2N)
Resolution limit
Weighted graphs? Yes
Directed graphs? No
#------------
#Fast greedy modularity optimisation: Clauset-Newman-Moore
#------------
alg_name='Clauset-Newman-Moore'
greedy_c <- fastgreedy.community(g_karate)
#look at result of algorithm
print(greedy_c)
## IGRAPH clustering fast greedy, groups: 3, mod: 0.38
## + groups:
## $`1`
## [1] 1 5 6 7 11 12 17 20
##
## $`2`
## [1] 9 15 16 19 21 23 24 25 26 27 28 29 30 31 32 33 34
##
## $`3`
## [1] 2 3 4 8 10 13 14 18 22
##
#the process aims at maximising the modularity
modularity(greedy_c)
## [1] 0.3806706
# number of communities and their sizes
nc=length(greedy_c)
sizes(greedy_c)
## Community sizes
## 1 2 3
## 8 17 9
#if you need the membership vector only
mem_vec=membership(greedy_c)
#visualise the results
plot(greedy_c,g_karate)#,layout=layout.fruchterman.reingold)
title(paste0(alg_name,' algorithm \n',nc,' communities for the karate club'))
#The division into communities is determined by the choice of location for cutting the hierarchical tree.
#The threshold is selected such that the modularity is maximised
#Let us look at this through the dendrogram
dendPlot(greedy_c,use.modularity = TRUE)
title(paste0('Community structure dendrogram for ',alg_name,' method \n',nc,' communities for the karate club'))
Random walk: Pons & Latapy
Pascal Pons, Matthieu Latapy: Computing communities in large networks using random walks, http://arxiv.org/abs/physics/0512106
iGraph: walktrap.community
How does it work?
This function tries to find densely connected subgraphs
short random walks tend to stay in the same community
can use the modularity score to select where to cut the dendrogram
Performance
Weighted graphs? Yes
Directed graphs? No
#------------
#Community strucure via short random walks: Pons&Latapy
#------------
alg_name='Random walks'
wtrap_c <- walktrap.community(g_karate)
comm=wtrap_c
#look at result of algorithm
print(comm)
## IGRAPH clustering walktrap, groups: 5, mod: 0.35
## + groups:
## $`1`
## [1] 1 2 4 8 12 13 18 20 22
##
## $`2`
## [1] 3 9 10 14 29 31 32
##
## $`3`
## [1] 15 16 19 21 23 27 30 33 34
##
## $`4`
## + ... omitted several groups/vertices
#the process aims at maximising the modularity
modularity(comm)
## [1] 0.3532216
# number of communities and their sizes
nc=length(comm)
sizes(comm)
## Community sizes
## 1 2 3 4 5
## 9 7 9 4 5
#if you need the membership vector only
mem_vec=membership(comm)
#visualise the results
plot(comm,g_karate)#,layout=layout.fruchterman.reingold)
title(paste0(alg_name,' algorithm \n',nc,' communities for the karate club'))
#The division into communities is determined by the choice of location for cutting the hierarchical tree.
#The threshold is selected such that the modularity is maximised
#Let us look at this through the dendrogram
dendPlot(comm,use.modularity = TRUE)
title(paste0('Community structure dendrogram for ',alg_name,' method \n',nc,' communities for the karate club'))
Leading eigenvector: Newman spectral approach
MEJ Newman: Finding community structure using the eigenvectors of matrices, Physical Review E 74 036104, (2006)
iGraph: leading.eigenvector.community
http://igraph.org/r/doc/leading.eigenvector.html
http://www.inside-r.org/packages/cran/igraph/docs/leading.eigenvector.community
How does it work?
Method based on the spectral properties of graphs: IF communities well identified, e-vector components for nodes in same communities will have similar values
compute eigenvector of modularity matrix for the largest positive eigenvalue.
separate vertices into two community based on the sign of the corresponding element in the eigenvector. If all elements in the eigenvector are of the same sign that means that the network has no underlying community structure.
Choose partition that maximises the G-N modularity
Performance
Fast if select only a few e-vectors to compute, although slower than greedy
Good results
Weighted graphs? No
Directed graphs? No
#------------
#Leading eigenvector: Newman spectral approach
#------------
alg_name='Spectral'
spectral_c <- leading.eigenvector.community(g_karate)
comm=spectral_c
#look at result of algorithm
print(comm)
## IGRAPH clustering leading eigenvector, groups: 4, mod: 0.39
## + groups:
## $`1`
## [1] 1 5 6 7 11 12 17
##
## $`2`
## [1] 9 10 15 16 19 21 23 27 30 31 33 34
##
## $`3`
## [1] 2 3 4 8 13 14 18 20 22
##
## $`4`
## + ... omitted several groups/vertices
#the process aims at maximising the modularity
modularity(comm)
## [1] 0.3934089
# number of communities and their sizes
nc=length(comm)
sizes(comm)
## Community sizes
## 1 2 3 4
## 7 12 9 6
#if you need the membership vector only
mem_vec=membership(comm)
#visualise the results
plot(comm,g_karate)#,layout=layout.fruchterman.reingold)
title(paste0(alg_name,' algorithm \n',nc,' communities for the karate club'))
#Doesn't get a dendrogram
Louvain: Blondel et al, modularity optimization
V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre, J. Stat. Mech.: Theory Exp. 2008, P10008.
http://arxiv.org/abs/arXiv:0803.0476
iGraph: multilevel.community
http://igraph.org/r/doc/multilevel.community.html
http://www.inside-r.org/packages/cran/igraph/docs/multilevel.community
How does it work?
Local optimization of the G-N modularity in the neighbourhood of each node
Start from the whole network, identify a community, replace the community by a super-node –> get smaller weighted and repeat
Iterate until modularity stable
Performance
Excellent
Computational linear complexity w.r.t. n. nodes
Weighted graphs? Yes
Directed graphs? No
#------------
#Louvain method: Blondel et al, modularity optimization
#------------
alg_name='Louvain'
louv_c <- multilevel.community(g_karate)
comm=louv_c
#look at result of algorithm
print(comm)
## IGRAPH clustering multi level, groups: 4, mod: 0.42
## + groups:
## $`1`
## [1] 5 6 7 11 17
##
## $`2`
## [1] 1 2 3 4 8 10 12 13 14 18 20 22
##
## $`3`
## [1] 24 25 26 28 29 32
##
## $`4`
## + ... omitted several groups/vertices
#the process aims at maximising the modularity
modularity(comm)
## [1] 0.4188034
# number of communities and their sizes
nc=length(comm)
sizes(comm)
## Community sizes
## 1 2 3 4
## 5 12 6 11
#if you need the membership vector only
mem_vec=membership(comm)
#visualise the results
plot(comm,g_karate)#,layout=layout.fruchterman.reingold)
title(paste0(alg_name,' algorithm \n',nc,' communities for the karate club'))
Infomap: Rosvall & Bergstrom
M. Rosvall and C. T. Bergstrom, Maps of information flow reveal community structure in complex networks, PNAS 105, 1118 (2008)
M. Rosvall, D. Axelsson, and C. T. Bergstrom, The map equation, Eur. Phys. J. Special Topics 178, 13 (2009).
iGraph: infomap.community
http://igraph.org/r/doc/infomap.html
http://www.inside-r.org/packages/cran/igraph/docs/infomap.community
How does it work?
Find community structure that minimizes the expected description length of a random walker trajectory
Performance
Weighted graphs? Yes
Directed graphs? Yes
#------------
#Infomap method: Rosvall and Bergstrom
#------------
alg_name='Infomap'
info_c <- infomap.community(g_karate)
comm=info_c
#look at result of algorithm
print(comm)
## IGRAPH clustering infomap, groups: 3, mod: 0.4
## + groups:
## $`1`
## [1] 9 15 16 19 21 23 24 25 26 27 28 29 30 31 32 33 34
##
## $`2`
## [1] 1 2 3 4 8 10 12 13 14 18 20 22
##
## $`3`
## [1] 5 6 7 11 17
##
#the process aims at maximising the modularity
modularity(comm)
## [1] 0.4020381
# number of communities and their sizes
nc=length(comm)
sizes(comm)
## Community sizes
## 1 2 3
## 17 12 5
#if you need the membership vector only
mem_vec=membership(comm)
#visualise the results
plot(comm,g_karate)#,layout=layout.fruchterman.reingold)
title(paste0(alg_name,' algorithm \n',nc,' communities for the karate club'))
include_graphics(".\\img\\networks2\\table_mod.png")
include_graphics(".\\img\\networks2\\Expert.png")
include_graphics(".\\img\\networks2\\Expert2.png")
This session will cover the following:
The code presented here follows the methods used in The Product Space Conditions the Development of Nations By C. A. Hidalgo, B. Klinger, A.-L. Barabási, R. Hausmann, Science Jul 2007
It’s a great paper and worth a read.
You can access world import and export data from the United Nations here: https://cid.econ.ucdavis.edu/nberus.html.
See page 48 of the documentation (.\data\wtf99\NBER-UN_Data_Documentation_w11040.pdf) for an explanation of the variables.
# Load world import and export data from 1999. We restrict our analysis to a single year for ease but it is better to use all available years.
dfWorldTrade <- read.dta13(".\\data\\wtf99\\wtf99.dta")
head(dfWorldTrade)
## year icode importer ecode exporter sitc4 unit dot value quantity
## 1 1999 100000 World 100000 World NA 3854744 NA
## 2 1999 100000 World 100000 World 0011 NA 1041246 NA
## 3 1999 100000 World 100000 World 0011 N NA 388392 1970655
## 4 1999 100000 World 100000 World 0011 W NA 2537670 1357983
## 5 1999 100000 World 100000 World 0012 NA 36635 NA
## 6 1999 100000 World 100000 World 0012 N NA 36197 669442
Products are differentiated by their sitc4 code
# 1263 different product codes includes 1 blank code so 1262 in practice
length(unique(dfWorldTrade$sitc4))
## [1] 1263
# 188 unique exporter countries
length(unique(dfWorldTrade$ecode))
## [1] 188
Data cleaning
# Remove invalid product codes and select only the columns we are interested in
dfWorldTrade <- dfWorldTrade[dfWorldTrade$sitc4 !="",c("ecode","sitc4","value")]
# Remove entries where the exporter is 'World'. The World category is assigned to imports/exports with missing data.
dfWorldTrade <- dfWorldTrade[dfWorldTrade$ecode != "100000",]
head(dfWorldTrade)
## ecode sitc4 value
## 2301 117100 0011 2262
## 2302 117100 0012 131
## 2303 117100 0014 556
## 2304 117100 0015 2422
## 2305 117100 0111 278
## 2306 117100 0111 117149
Group by country and product and calculate the value of each county’s exports by product
# Aggregate the data
dfWTCountryAgg <- aggregate(dfWorldTrade$value, by = list(dfWorldTrade$ecode), sum, na.rm = TRUE)
dfWTCountryProdAgg <- aggregate(dfWorldTrade$value, by = list(dfWorldTrade$ecode, dfWorldTrade$sitc4), sum, na.rm = TRUE)
dfWTProductAgg <- aggregate(dfWorldTrade$value, by = list(dfWorldTrade$sitc4), sum, na.rm = TRUE)
globalSum <- sum(dfWorldTrade$value)
# Rename the columns
colnames(dfWTCountryAgg) <- c("ecode", "etotal")
colnames(dfWTProductAgg) <- c("sitc4", "ptotal")
colnames(dfWTCountryProdAgg) <- c("ecode", "sitc4", "eptotal")
head(dfWTCountryAgg)
## ecode etotal
## 1 117100 63359546
## 2 130120 26996016
## 3 134340 17184250
## 4 135040 16460714
## 5 137360 1173686
## 6 137880 12947674
head(dfWTProductAgg)
## sitc4 ptotal
## 1 0011 7934616
## 2 0012 1597522
## 3 0013 2496742
## 4 0014 1814336
## 5 0015 3247458
## 6 0019 1920
head(dfWTCountryProdAgg)
## ecode sitc4 eptotal
## 1 117100 0011 4524
## 2 167060 0011 16028
## 3 211240 0011 1456330
## 4 218400 0011 422514
## 5 330320 0011 7852
## 6 330760 0011 1982
Join these datasets together and use this to calculate RCA
dfRCA <- merge(dfWTCountryProdAgg, dfWTCountryAgg, by ="ecode")
dfRCA <- merge(dfRCA, dfWTProductAgg, by="sitc4")
dfRCA$prod_in_country_share <- dfRCA$eptotal / dfRCA$etotal
dfRCA$prod_in_global_share <- dfRCA$ptotal / globalSum
dfRCA$rca = dfRCA$prod_in_country_share / dfRCA$prod_in_global_share
head(dfRCA)
## sitc4 ecode eptotal etotal ptotal prod_in_country_share
## 1 0011 117100 4524 63359546 7934616 7.140203e-05
## 2 0011 457020 646 178684928 7934616 3.615302e-06
## 3 0011 532500 2423430 601149100 7934616 4.031329e-03
## 4 0011 338580 25338 4825320 7934616 5.251051e-03
## 5 0011 451160 5612 2336744 7934616 2.401632e-03
## 6 0011 530560 300186 287522372 7934616 1.044044e-03
## prod_in_global_share rca
## 1 0.0006855073 0.104159392
## 2 0.0006855073 0.005273907
## 3 0.0006855073 5.880796810
## 4 0.0006855073 7.660094153
## 5 0.0006855073 3.503437883
## 6 0.0006855073 1.523023789
RCA > 1 means that this product makes up a greater share of this country’s exports than the “average” country.
To calculate the proximity between products we use RCA as an indication of whether a country exports a product or not. If a country has an RCA > 1 for a product it is considered to export that product.
Using this the coditional probability of a product being exported given another product being exported is given by:
\[P(product_{i} | product_{j}) = \frac{\text{number of countries exporting i and j}}{\text{number of countries exporting j}}\]
# Number of counties exporting each product
dfRCA$is_exported <- as.numeric(dfRCA$rca > 1)
dfNExport <- aggregate(dfRCA$is_exported, by = list(dfRCA$sitc4), sum)
colnames(dfNExport)[2] <- "nexporters"
head(dfNExport)
## Group.1 nexporters
## 1 0011 21
## 2 0012 24
## 3 0013 13
## 4 0014 15
## 5 0015 20
## 6 0019 4
# Number of countries exporting product paris
# This block takes a long time to run. To save time (and avoid crashing your computer), you can read the data from csv using the next block instead.
# Merge on country code to match exports to one another if they are both exported by the same country, then count the number of countries than export both products
#dfPairs <- merge(dfRCA, dfRCA, by = "ecode")
#dfPairs$joint_export <- as.numeric((dfPairs$is_exported.x == 1) & (dfPairs$is_exported.y ==1))
#dfNJointExport <- aggregate(dfPairs$joint_export, by = list(dfPairs$sitc4.x,dfPairs$sitc4.y), sum)
#colnames(dfNJointExport)[3] <- "njointexporters"
#write.csv(dfNJointExport, ".\\data\\product_joint_export_counts.csv", row.names=FALSE)
#head(dfNJointExport)
dfNJointExport <- read.csv(".\\data\\product_joint_export_counts.csv")
head(dfNJointExport)
## Group.1 Group.2 njointexporters
## 1 0011 0011 21
## 2 0012 0011 9
## 3 0013 0011 5
## 4 0014 0011 4
## 5 0015 0011 6
## 6 0019 0011 1
Use these two tables to compute the conditional probabilities
dfConditional <- merge(dfNJointExport, dfNExport, by = 'Group.1')
dfConditional$pij <- dfConditional$njointexporters / dfConditional$nexporters
head(dfConditional)
## Group.1 Group.2 njointexporters nexporters pij
## 1 0011 0011 21 21 1.00000000
## 2 0011 6424 7 21 0.33333333
## 3 0011 01AA 1 21 0.04761905
## 4 0011 2772 3 21 0.14285714
## 5 0011 4313 5 21 0.23809524
## 6 0011 842X 1 21 0.04761905
Now identify and select the minimum joint probability by joining the data frame of confitional probabilities to itself.
# Merge the dfConditional dataframe to itself so that we can compare the joint probabilities P(export i | export j) to P(export j | export i)
dfProximity <- merge(dfConditional, dfConditional, by.x = c("Group.1","Group.2"), by.y = c("Group.2", "Group.1"))
dfProximity$pij.x.islower <- dfProximity$pij.x < dfProximity$pij.y
head(dfProximity)
## Group.1 Group.2 njointexporters.x nexporters.x pij.x
## 1 0011 0011 21 21 1.00000000
## 2 0011 0012 9 21 0.42857143
## 3 0011 0013 5 21 0.23809524
## 4 0011 0014 4 21 0.19047619
## 5 0011 0015 6 21 0.28571429
## 6 0011 0019 1 21 0.04761905
## njointexporters.y nexporters.y pij.y pij.x.islower
## 1 21 21 1.0000000 FALSE
## 2 9 24 0.3750000 FALSE
## 3 5 13 0.3846154 TRUE
## 4 4 15 0.2666667 TRUE
## 5 6 20 0.3000000 TRUE
## 6 1 4 0.2500000 TRUE
# We can check this merge has given us the right thing by checking the conditional probability value for Group.1 = "0015" and Group.2 = "0011" has been merged with the value for Group.1 = "0011" and Group.2 = "0015"
dfConditional[(dfConditional$Group.1=="0015") & (dfConditional$Group.2 == "0011"),]
## Group.1 Group.2 njointexporters nexporters pij
## 5393 0015 0011 6 20 0.3
# Now select the lower conditional probabilities. The lower conditional probability is used as the proximity between two products
dfProximity1 <- dfProximity[dfProximity$pij.x.islower ==TRUE, c("Group.1", "Group.2", "pij.x")]
dfProximity2 <- dfProximity[dfProximity$pij.x.islower ==FALSE, c("Group.1", "Group.2", "pij.y")]
colnames(dfProximity1) <- c("product1","product2", "proximity")
colnames(dfProximity2) <- c("product1","product2", "proximity")
dfProximity <- rbind(dfProximity1, dfProximity2)
# Drop entries where product1 == product2 as the proximity is trivally 1
dfProximity <- dfProximity[dfProximity$product1!=dfProximity$product2,]
# Check for duplicates
any(duplicated(dfProximity[,1:2]))
## [1] FALSE
This dataframe can be used to build a network where the nodes are the products and the edge weights between products are the proximity.
head(dfProximity)
## product1 product2 proximity
## 3 0011 0013 0.23809524
## 4 0011 0014 0.19047619
## 5 0011 0015 0.28571429
## 6 0011 0019 0.04761905
## 10 0011 0112 0.19047619
## 11 0011 0113 0.38095238
Create an iGraph object from this edge list
graphProdSpace <- graph_from_data_frame(dfProximity, directed = FALSE)
This is a very dense graph so plotting the whole thing will not be helpful. In the Hidalgo (2007) paper the authors describe the variety of methods they used to represent the network visually and produce a figure like that in the slide below. See the supporting online material for details: https://science.sciencemag.org/content/sci/suppl/2007/07/26/317.5837.482.DC1/Hidalgo.SOM.pdf
Calculate ‘density’ around products for a single country (eg UK)
# First subsect the product space to consider only the products exported by a particular country, eg the UK
# The data documentation (.\\data\\wtf99\\NBER-UN_Data_Documentation_w11040.pdf) contains a lookup from country name to country code. The UK is 538260
# Get all products extored by the UK
ukExports <- dfRCA[(dfRCA$is_exported == 1) & (dfRCA$ecode == "538260"),"sitc4"]
# Using dfProximity as the edge list, remove connections to nodes not exported by the UK
dfProximityUK <- dfProximity[dfProximity$product2 %in% ukExports,]
# Now aggregate by product to get the weight contribution from UK exported products only
dfStrengthUK <- aggregate(dfProximityUK$proximity, by = list(dfProximityUK$product1), sum)
colnames(dfStrengthUK) <- c("product","strengthUK")
head(dfStrengthUK)
## product strengthUK
## 1 0011 52.28548
## 2 0012 54.28734
## 3 0013 47.37124
## 4 0014 82.83941
## 5 0015 63.04949
## 6 0019 13.22206
# The denominator of the density of a node is given by the sum of its edge weights. This is similar to the node degree but instead we use the edge function
strengthProdSpace <- strength(graphProdSpace,weights = get.edge.attribute(graphProdSpace)$proximity)
dfStrength <- data.frame(product = names(strengthProdSpace), strength = strengthProdSpace)
rownames(dfStrength) <- c(1:length(dfStrength$product))
head(dfStrength)
## product strength
## 1 0011 334.12564
## 2 0012 303.94074
## 3 0013 314.03959
## 4 0014 420.02677
## 5 0015 321.33070
## 6 0019 70.90681
# Now combine together and identify the high density products the UK does not export
dfStrength <- merge(dfStrength, dfStrengthUK)
dfStrength$density <- dfStrength$strengthUK / dfStrength$strength
# Now select just the products that the UK doesn't export and rank by strength
dfUKNotExp <- dfStrength[!(dfStrength$product %in% ukExports),]
head(dfUKNotExp[with(dfUKNotExp, order(density, decreasing = TRUE)),])
## product strength strengthUK density
## 450 524A 58.77814 11.14369 0.1895890
## 1089 791X 58.77814 11.14369 0.1895890
## 430 5169 381.65871 71.24523 0.1866726
## 6 0019 70.90681 13.22206 0.1864710
## 473 5416 411.47924 75.92215 0.1845103
## 1027 7741 317.77398 58.46509 0.1839833
What products are these?
# The descriptions for revision 1 and 2 of the SITC codes can be downloaded from here: https://unstats.un.org/unsd/tradekb/Knowledgebase/50262/Commodity-Indexes-for-the-Standard-International-Trade-Classification-Revision-3
# The codes used here are the 4th revision but we will use the descriptions from the 2nd revision for ease.
# The 4th revision is detailed here: https://unstats.un.org/unsd/publications/catalogue?selectID=104
dfSITC <- read.csv(".\\data\\SITC Rev2.csv")
colnames(dfSITC) <- c("commodity.code", "commodity.description")
head(dfSITC)
## commodity.code commodity.description
## 1 0 Food and live animals chiefly for food
## 2 00 Live animals chiefly for food
## 3 001 Live animals chiefly for food
## 4 0011 Animals of the bovine species (including buffaloes), live
## 5 00111 -- pure bred for breeding
## 6 00119 -- other than pure bred for breeding
# Add product descriptions into the stength dataframe
dfStrength$product_group <- substr(dfStrength$product, start = 1, stop = 2)
dfStrength <- merge(dfStrength, dfSITC, by.x = c("product"), by.y = c("commodity.code"))
dfStrength <- merge(dfStrength, dfSITC, by.x = c("product_group"), by.y = c("commodity.code"))
colnames(dfStrength)[6:7] <- c("product_description","group_description")
head(dfStrength)
## product_group product strength strengthUK density
## 1 00 0011 334.12564 52.28548 0.1564845
## 2 00 0012 303.94074 54.28734 0.1786116
## 3 00 0013 314.03959 47.37124 0.1508448
## 4 00 0014 420.02677 82.83941 0.1972241
## 5 00 0015 321.33070 63.04949 0.1962137
## 6 00 0019 70.90681 13.22206 0.1864710
## product_description
## 1 Animals of the bovine species (including buffaloes), live
## 2 Sheep and goats, live
## 3 Swine, live
## 4 Poultry, live
## 5 Equine species, live
## 6 Live animals of a kind mainly used for human food, nes
## group_description
## 1 Live animals chiefly for food
## 2 Live animals chiefly for food
## 3 Live animals chiefly for food
## 4 Live animals chiefly for food
## 5 Live animals chiefly for food
## 6 Live animals chiefly for food
# Now select just the products that the UK doesn't export and rank by strength
dfUKNotExp <- dfStrength[!(dfStrength$product %in% ukExports),]
head(dfUKNotExp[with(dfUKNotExp, order(density, decreasing = TRUE)),])
## product_group product strength strengthUK density
## 268 51 5169 381.65871 71.24523 0.1866726
## 6 00 0019 70.90681 13.22206 0.1864710
## 292 54 5416 411.47924 75.92215 0.1845103
## 633 77 7741 317.77398 58.46509 0.1839833
## 715 87 8720 297.62421 54.32069 0.1825144
## 634 77 7742 347.62119 63.27810 0.1820318
## product_description
## 268 Organic chemicals, nes
## 6 Live animals of a kind mainly used for human food, nes
## 292 Glycosides, glands, antisera, vaccines and similar products
## 633 Electro-medical equipment
## 715 Medical instruments and appliances, nes
## 634 X-ray apparatus and equipment; accessories; and parts, nes
## group_description
## 268 Organic chemicals
## 6 Live animals chiefly for food
## 292 Medicinal and pharmaceutical products
## 633 Electric machinery, apparatus and appliances, nes, and parts, nes
## 715 Professional, scientific, controlling instruments, apparatus, nes
## 634 Electric machinery, apparatus and appliances, nes, and parts, nes
# What product *does* the UK import that have the highest density
dfUKExp <- dfStrength[dfStrength$product %in% ukExports,]
head(dfUKExp[with(dfUKExp, order(density, decreasing = TRUE)),])
## product_group product strength strengthUK density
## 188 28 2860 98.87077 33.12124 0.3349953
## 606 75 7521 182.50646 49.63667 0.2719721
## 535 71 7149 224.58335 57.67624 0.2568144
## 674 79 7928 208.63288 53.46280 0.2562530
## 675 79 7929 215.60783 54.39888 0.2523047
## 121 22 2234 194.68738 48.29262 0.2480521
## product_description
## 188 Ores and concentrates of uranium and thorium
## 606 Analogue and hybrid data processing machines
## 535 Parts, nes of the engines and motors of group 714 and item 71888
## 674 Aircraft, nes and associated equipment
## 675 Parts, nes of the aircraft of heading 792
## 121 Linseed
## group_description
## 188 Metalliferous ores and metal scrap
## 606 Office machines and automatic data processing equipment
## 535 Power generating machinery and equipment
## 674 Other transport equipment
## 675 Other transport equipment
## 121 Oil seeds and oleaginous fruit